Intro
As a long time hacker news lurker/reader, I’ve been captivated by the site for many years. I finally decided to get more experience with interacting with rest APIs and explore scraping stories from Hacker News. Fortunately, they have a convenient API that they offer up for free!
The Hacker News API is described at this link:
https://github.com/HackerNews/API?tab=readme-ov-file
What is an API?
An API stands for Application Programming Interface. The word “interface” is the most important here. The interface to any complex machine we want to manipulate is how we interact, control, and receive feedback. The interface to the car is the steering wheel, stick & pedals, and dash instrumentation; the rest of how the engine accelerates/decelerates, transfers power to the wheels for rotation and steers them is largely abstracted away from us. We have these nice controls with which we interface with a vehicle.
Similarly, today we have API’s that we can use to retrieve information and resources digitally and interface with applications, whether that be via a function call, creating a class instance, or making a request of a web API.
Modern web APIs can differ quite a bit in their design, with many adopting the REST API architecture design. REST stands for REpresentational State Transfer, it was written by Roy Fielding in 2000 in his PhD. Thesis, but he had been instrumental in the development of several key protocols prior to that such as HTTP 1.0 & 1.1 and URI. This important architectural design would be instrumental in serving the internet at scale.
There are 6 guiding constraints of REST that when applied to the system architecture it gains performance, scalability, simplicity, modifiability, visibility, portability, and reliability. @ref-rest_wiki
These 6 constraints are:
- Client/Server – Client are separated from servers by a well-defined interface (Simplicity, Modifiability)
- Stateless – A specific client does not consume server storage when it is “at rest” ()
- Cache – Responses indicate their own cacheability
- Uniform interface ()
- Layered system – A client cannot ordinarily tell whether it is connected directly to the end server, or to an intermediary along the way (Modifiability, Scalability)
- Code on demand (optional) – Servers are able to temporarily extend or customize the functionality of a client by transferring logic to the client that can be executed within a standard virtual machine (Modifiability)
Today, most web APIs use RESTful HTTP to exchange information. Almost all network traffic to and from a web browser is conducted over the HTTP(S) protocol, although this is hard to accurately quantify because some traffic uses other streaming protocols which do use TCP but not HTTP. HTTP (Hypertext Transfer Protocol) defines various verbs that are used in the exchange of information over the internet, and if your aware of the OSI model it falls into application layer which is the top most layer of the internet protocol suite. (If you want to know more you can go down that rabbit hole understanding data encapsulation in the modern internet stack, this video seems to do a good job: https://www.youtube.com/watch?v=P5jC8D5zndc)
These verbs are named as follows: PUT, POST, GET, DELETE, HEAD, CONNECT, OPTIONS, TRACE, PATCH. Really you only need to know a few of these to really make use of HTTP.
GET is used to request a resource from a HTTP server with limited processing and no changes necessary. This can be communicated in the URL, and the nice thing is the network can benefit from caching, significantly improving performance via latency & response-time.
Such systems include Content Delivery Networks which can cache content regionally and deliver it with lowered latency, or in-memory key-value stores such a Redis that can deliver a resource with faster response times.
POST is a another request method that requests some data be processed depending on on the semantics of the application.
HTTP requests follows a specified format, the first line starts with:
<a case-sensitive request method><space><resource><http protocol version><carriage return><new-line>
followed by 1 or more header fields that look like such:
<field-name>:<optional space><field-value>
These header fields specify such values as which hostname to use from the server, this is used for multiple virtual-hosts on the server
Here’s an example HTTP GET request:
GET / HTTP/1.1\r\n
Host: example.com\r\n
Upgrade-Insecure-Requests: 1\r\n
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15\r\n
Accept-Language: en-US,en;q=0.9\r\n
Accept-Encoding: gzip, deflate\r\n
Connection: keep-alive\r\n
\r\n
This request gets sent over the established network connection to the server as plain-text and the various fields get intepreted by the webserver and an appropriate response is sent back.
This is essentially how modern web connections are conducted. A user either clicks a URL or enters in a URL and then the browser begins the establishment of a connection to the web server, this usually iniates an HTTP GET request and the web resource is returned back to the client browser, often times this could be an HTML page, a JSON file, or some other media.
A Usable Python API for Hacker News
A big part of software is about developing APIs for interfacing with some application and this often involves building upon more lower-level APIs to make it easier to do what you want to do. In this case, working directly with the URLs to get specific items from the Hacker News API is quite cumbersome, so I’d like to develop my own higher level API to handle alot of the URL manipulation for me.
Every “resource” in the hacker news API is represented by a unique ID. You can use this unique ID along with the known URL schema to retrieve these resources from the server. example resources are stories, comments, and special meta-threads on Hacker News. If it’s a story resource, then it will contain references to the top-level child comment IDs of that story and those comments have subsequent references to further child comments. Knowing this scheme, we can crawl the entire tree of comments and gather all comments to a specific story. With this information, we can build up an entire API for doing things such as: getting the nth top story threads, getting all the top level comments to a story, getting all the nth level comments and/or getting all the sibling comments, or just the entire comment tree to a specific story.
Here’s some examples of the functions that make my own API
def get_top_stories(method, *args, **kwargs) -> Optional[List[Union[story_id, story_obj]]]:
= method.upper()
method if method == 'API':
return get_stories_from_api(get_top_stories_url())
elif method == 'FILE':
return get_stories_from_file(*args, **kwargs)
return None
Here is a method for getting the current top stories or getting ‘cached’ top stories that are saved off onto the file system References
[0] [1]
- Intro to Hacker News API
- Building Up an API for retrieving information
- Sentiment Analysis
- Topic Analysis
- Number of Different Topics
- Breakdown of Industries